{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nltk\n",
    "from nltk import CFG"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 文法 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "groucho_grammar = CFG.fromstring(\"\"\"\n",
    "S -> NP VP\n",
    "PP -> P NP\n",
    "NP -> Det N | Det N PP | 'I'\n",
    "VP -> V NP | VP PP\n",
    "Det -> 'an' | 'my'\n",
    "N -> 'elephant' | 'pajamas'\n",
    "V -> 'shot'\n",
    "P -> 'in'\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(S\n",
      "  (NP I)\n",
      "  (VP\n",
      "    (VP (V shot) (NP (Det an) (N elephant)))\n",
      "    (PP (P in) (NP (Det my) (N pajamas)))))\n",
      "(S\n",
      "  (NP I)\n",
      "  (VP\n",
      "    (V shot)\n",
      "    (NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas))))))\n"
     ]
    }
   ],
   "source": [
    "sent = ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas']\n",
    "parser = nltk.ChartParser(groucho_grammar)\n",
    "trees = parser.parse(sent)\n",
    "for tree in trees:\n",
    "    print(tree)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 上下文无关文法（CFG）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "grammar = CFG.fromstring(\"\"\"\n",
    "S -> NP VP\n",
    "VP -> V NP | V NP PP\n",
    "PP -> P NP\n",
    "V -> 'saw' | 'ate' | 'walked'\n",
    "NP -> 'john' | 'Mary' | 'Bob' | Det N | Det N PP \n",
    "Det -> 'a' | 'an' | 'the' | 'my'\n",
    "N -> 'man' | 'dog' | 'cat' | 'telescope' | 'park'\n",
    "P -> 'in' | 'on' | 'by' | 'with'\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "# sent = 'Mary saw Bob'.split()\n",
    "sent = 'the dog saw a man in the park'.split()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(S\n",
      "  (NP (Det the) (N dog))\n",
      "  (VP\n",
      "    (V saw)\n",
      "    (NP (Det a) (N man) (PP (P in) (NP (Det the) (N park))))))\n",
      "(S\n",
      "  (NP (Det the) (N dog))\n",
      "  (VP\n",
      "    (V saw)\n",
      "    (NP (Det a) (N man))\n",
      "    (PP (P in) (NP (Det the) (N park)))))\n"
     ]
    }
   ],
   "source": [
    "rd_parser = nltk.RecursiveDescentParser(grammar)\n",
    "for tree in rd_parser.parse(sent):\n",
    "    print(tree)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 使用自己的文法"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "grammar = nltk.data.load('file:mygrammar.cfg')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(S (NP Mary) (VP (V saw) (NP Bob)))\n"
     ]
    }
   ],
   "source": [
    "sent = 'Mary saw Bob'.split()\n",
    "rd_parser = nltk.RecursiveDescentParser(grammar)\n",
    "for tree in rd_parser.parse(sent):\n",
    "    print(tree)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "RecursiveDescentParser 递归下降分析器：\n",
    "\n",
    "- 左递归产生式，如 NP -> NP PP ，会进行死循环\n",
    "- 浪费了很多时间处理不符合输入句子的词和结构\n",
    "- 回溯过程中可能会丢弃分析过的成分，它们将需要在之后再次重建"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(S (NP Mary) (VP (V saw) (NP (Det a) (N dog))))\n"
     ]
    }
   ],
   "source": [
    "sr_parse = nltk.ShiftReduceParser(grammar)\n",
    "for tree in sr_parse.parse(sent):\n",
    "    print(tree)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Parsing 'Mary saw a dog'\n",
      "    [ * Mary saw a dog]\n",
      "  S [ 'Mary' * saw a dog]\n",
      "  R [ NP * saw a dog]\n",
      "  S [ NP 'saw' * a dog]\n",
      "  R [ NP V * a dog]\n",
      "  S [ NP V 'a' * dog]\n",
      "  R [ NP V Det * dog]\n",
      "  S [ NP V Det 'dog' * ]\n",
      "  R [ NP V Det N * ]\n",
      "  R [ NP V NP * ]\n",
      "  R [ NP VP * ]\n",
      "  R [ S * ]\n",
      "(S (NP Mary) (VP (V saw) (NP (Det a) (N dog))))\n"
     ]
    }
   ],
   "source": [
    "sr_parse = nltk.ShiftReduceParser(grammar, trace=2)\n",
    "for tree in sr_parse.parse(sent):\n",
    "    print(tree)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "ShiftReduceParser 移进-规约分析器：\n",
    "\n",
    "- 不执行任何回溯，所以不能保证一定能找到一个文本的解析（即使真的存在）\n",
    "- 即使有多个解析最多也只能找到一个\n",
    "- 每个结构只建立一次\n",
    "\n",
    "通过优先执行规约操作来解决移进-规约冲突"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 交互式文法编辑器\n",
    "nltk.app.chartparser()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 符合语法规则的子串表（WFST）\n",
    "利用动态规划思想，结合了以上两种分析器\n",
    "\n",
    "具体见书：270页左右"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 依存文法"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "groucho_dep_grammar = nltk.grammar.DependencyGrammar.fromstring(\"\"\"\n",
    "'shot' -> 'I' | 'elephant' | 'in'\n",
    "'elephant' -> 'an' | 'in'\n",
    "'in' -> 'pajamas'\n",
    "'pajamas' -> 'my'\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dependency grammar with 7 productions\n",
      "  'shot' -> 'I'\n",
      "  'shot' -> 'elephant'\n",
      "  'shot' -> 'in'\n",
      "  'elephant' -> 'an'\n",
      "  'elephant' -> 'in'\n",
      "  'in' -> 'pajamas'\n",
      "  'pajamas' -> 'my'\n"
     ]
    }
   ],
   "source": [
    "print(groucho_dep_grammar)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(shot I (elephant an (in (pajamas my))))\n",
      "(shot I (elephant an) (in (pajamas my)))\n"
     ]
    }
   ],
   "source": [
    "pdp = nltk.ProjectiveDependencyParser(groucho_dep_grammar)\n",
    "sent = 'I shot an elephant in my pajamas'.split()\n",
    "trees = pdp.parse(sent)\n",
    "for tree in trees:\n",
    "    print(tree)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 特征结构"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ A = 'a'             ]\n",
      "[                     ]\n",
      "[ B = (1) [ C = 'c' ] ]\n",
      "[                     ]\n",
      "[ D -> (1)            ]\n",
      "[ E -> (1)            ]\n"
     ]
    }
   ],
   "source": [
    "print(nltk.FeatStruct('''\n",
    "[A = 'a',\n",
    "B = (1)[C = 'c'],\n",
    "D -> (1),\n",
    "E -> (1)]\n",
    "'''))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ CITY    = 'Paris'      ]\n",
      "[ NUMBER  = 74           ]\n",
      "[ STAREET = 'run Pascal' ]\n"
     ]
    }
   ],
   "source": [
    "fs1 = nltk.FeatStruct(NUMBER = 74,\n",
    "                     STAREET = 'run Pascal')\n",
    "fs2 = nltk.FeatStruct(CITY = 'Paris')\n",
    "\n",
    "print(fs1.unify(fs2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "None\n"
     ]
    }
   ],
   "source": [
    "fs0 = nltk.FeatStruct(A = 'a')\n",
    "fs1 = nltk.FeatStruct(A = 'b')\n",
    "\n",
    "fs2 = fs0.unify(fs1)\n",
    "print(fs2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ ADDRESS1 = [ NUMBER = 74           ] ]\n",
      "[            [ STREET = 'run Pascal' ] ]\n",
      "[ ADDRESS1 = ?x ]\n",
      "[ ADDRESS2 = ?x ]\n"
     ]
    }
   ],
   "source": [
    "fs1 = nltk.FeatStruct('''\n",
    "[ADDRESS1 = [NUMBER = 74, STREET = 'run Pascal']]\n",
    "''')\n",
    "fs2 = nltk.FeatStruct('''\n",
    "[ADDRESS1 = ?x,\n",
    "ADDRESS2 = ?x]\n",
    "''')\n",
    "print(fs1)\n",
    "print(fs2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ ADDRESS1 = (1) [ NUMBER = 74           ] ]\n",
      "[                [ STREET = 'run Pascal' ] ]\n",
      "[                                          ]\n",
      "[ ADDRESS2 -> (1)                          ]\n"
     ]
    }
   ],
   "source": [
    "print(fs2.unify(fs1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
